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Abstract 

Background: In recent years, both single-nucleotide polymorphism (SNP) array and functional magnetic resonance 
imaging (fMRI) have been widely used for the study of schizophrenia (SCZ). In addition, a few studies have been 
reported integrating both SNPs data and fMRI data for comprehensive analysis. 

Methods: In this study, a novel sparse representation based variable selection (SRVS) method has been proposed 
and tested on a simulation data set to demonstrate its multi-resolution properties. Then the SRVS method was 
applied to an integrative analysis of two different SCZ data sets, a Single-nucleotide polymorphism (SNP) data set 
and a functional resonance imaging (fMRI) data set, including 92 cases and 116 controls. Biomarkers for the disease 
were identified and validated with a multivariate classification approach followed by a leave one out (LOO) cross- 
validation. Then we compared the results with that of a previously reported sparse representation based feature 
selection method. 

Results: Results showed that biomarkers from our proposed SRVS method gave significantly higher classification 
accuracy in discriminating SCZ patients from healthy controls than that of the previous reported sparse 
representation method. Furthermore, using biomarkers from both data sets led to better classification accuracy 
than using single type of biomarkers, which suggests the advantage of integrative analysis of different types of 
data. 

Conclusions: The proposed SRVS algorithm is effective in identifying significant biomarkers for complicated disease 
as SCZ. Integrating different types of data (e.g. SNP and fMRI data) may identify complementary biomarkers 
benefitting the diagnosis accuracy of the disease. 



Background 

Schizophrenia (SCZ) is one of the most disabling and 
emotionally devastating illnesses. The global median life- 
time morbid risk for schizophrenia is 7.2/1,000 persons 
[1]. Genetic factors play an important role in the develop- 
ment of schizophrenia. To date, over 1000 genes have 
been reported to associate with SCZ (http://www.szgene. 
org/default.asp) and many SNPs have been identified as 
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biomarkers for the disease [2-4]. For example, Kordi- 
Tamandani et al. showed that that promoter methylation of 
the CTLA4 gene can increase the risk of SCZ disease [2]. 
Shayevitz et al. confirmed the gene NOTCH4 as a candi- 
date gene for schizophrenia with genome-wide association 
studies (GWAS) [3]. Chen et al. stated that three SNPs 
spanning the MY05B gene are significantly associated with 
SCZ: rs4939921, rsl557355 and rs4939924 [4]. Besides 
genomic data, fMRI is another widely used data modality in 
SCZ studies [5] [6] . To date, many methods have been pro- 
posed to integrate multi-types of data in SCZ disease study 
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[7-11]. For example, Chen et al. proposed parallel indepen- 
dent component analysis (paraICA) to identify genomic 
risk components associated with brain function abnormal- 
ities and detected significant biomarkers from both fMRI 
data and SNP data that are strongly correlated [7]. Parallel 
ICA is an effective method for the joint analysis of multiple 
modalities including interconnections between them [8] . 
Utilizing this method, Meda et al. detected three fMRI 
components significantly correlated with two distinct gene 
components in SCZ study [11]. In this study, a novel sparse 
representation based variable selection (SRVS) method was 
proposed and applied to an integrative analysis of two types 
of data: fMRI and SNP, aiming to obtain comprehensive 
analysis. 

Sparse representation including compressive sensing has 
been widely used in signal/image processing and computa- 
tional mathematics [12-18]. Candes et al. showed that 
stable signal can be approximately recovered from incom- 
plete and inaccurate measurements [14]. Wright et al. pro- 
posed a sparse representation based clustering (SRC) for 
face recognition, demonstrating high classification accu- 
racy [15]. In our recent works [16-18], we developed novel 
classification and feature selection algorithms based on 
sparse representation theory. We applied those methods 
to gene expression data analysis [16], to chromosome 
image classification [18], and to joint analysis of different 
data modalities (e.g. SNP data and gene expression data) 
[17], and achieved improved classification accuracies as 
well as better feature selections. 

In applications of sparse representation, The availability of 
a limited number of samples is an important issue (e.g., fea- 
ture selection and signal recovery) [19] [20] [21]. According 
to compressive sensing theory (e.g., the restricted isometry 
property (RIP) condition [23] [24] for signal recovery), the 
number of available samples should not be less than the 
number of signals to be selected/recovered. However, the 
number of features/variables in genomic data (e.g. SNP 
data) or medical imaging data (e.g. fMRI data) are usually 
significantly big than the number of samples. In those cases, 
the traditional methods for compressive sampling cannot 
effectively analyse the data. 

In a recent work, Li et al. [21] developed a voxel selection 
algorithm for fMRI data analysis. The method was based 
on sparse representation and is designed to get a sparse 
solution when sufficient samples exist. However, it may 
not handle the small sample problem described above. 

In this study, a novel sparse representation based vari- 
able selection (SRVS) algorithm was proposed to select 
relevant biomarkers from big data sets having small sam- 
ple sizes. The analysis was obtained by using a window 
based approach, whose size determines the resolution of 
the variable selection. We first tested the SRVS algorithm 
on a simulated data set (size of 100 x le 6 , with 50 cases 
and 50 controls), demonstrating the multi-resolution 



characteristic of the method. Then the algorithm was 
applied to an integrative analysis of two real data sets: a 
SNP data set (size of 208 x 759075) and a fMRI data set 
(size of 208 x 153594). Using the proposed SRVS algo- 
rithm, biomarkers for SCZ were identified and validated. 

Methods 

fMRI and SNP data collection 

A total of 208 subjects, after signing informed consent, 
were recruited in the study, including 96 SCZ cases (age: 
34 ± 11, 74 males) and 112 healthy controls (age: 32 ± 11, 
68 males). Both SNP and fMRI data were collected from 
each of those 208 subjects. The healthy controls have no 
history of psychiatric disorders and were free of any medi- 
cal. SCZ cases met the DSM-IV diagnostic criteria for 
schizophrenia. After pre-processing, 153594 fMRI voxels 
and 759075 SNP loci were obtained for the following 
biomarker selections. Please refer to [22] for detailed 
description of data collection and pre-processing. 

Generalized sparse model 

To combine different data sets for integrative analysis, 
we consider the following model: 



y = [aiXi,a 2 X 2 ] 



+ e = XS + E 



(1) 



where y e R" xl is the phenotype vector of the subjects; 
matrix X\ e R nx P! and X 2 e R nx P 2 represent data sets of 
different modalities having normalized column vectors 
(e.g., ||*|| 2 = 1); X= [a 1 X 1 ,a 2 X 2 ] e R" X P; a 1+ a 2 = l, 
and ai,a 2 > 0 are the weight factors for X\ and X 2 
respectively. The measurement error e g R" xl . We aim to 



reconstruct the unknown sparse vector ,5 = 



e RP 



X 1 



based on Y and X, where Si S R pl x \ <5 2 G R pl x \ and 

Pi + P2 = P. 

It can be proven that when p > 35n, the matrix 
X € R nxp has the difficulty to satisfy the restricted isome- 
try property (RIP) condition [24] for signal recovery. In 
this work, p = 759075 + 153594 = 912669 and n = 208. 
Thus p ^> 35n = 7280. To overcome this problem, we 
propose the SRVS algorithm described as follows. 

SRVS algorithm 

To best approximate y with the model given by Eq. (1), we 
consider the following Lp minimization problem: 



min \ \S\\ p subject to \\y — XS\\ 2 < s 



(2) 



where || * || 2 represents Lp norm; p e [0, 1]. The SRVS 
Algorithm given below is used to solve the L p minimiza- 
tion problem and select the phenotype relevant column 
vectors out of X. 



Cao ef al. BMC Medical Genomics 2013, 6(Suppl 3):S2 
http://www.biomedcentral.eom/1755-8794/6/S3/S2 



Page 3 of 8 



Spare representation base variable selection (SRVS) 
algorithm (http://hongbaocao.weebly.com/software-for- 
download.html) 

1. Initialize ,5( 0 ) = o; 

2. For the / th step, randomly select X; e J?" xli from 
X = { } e R nxp ; Mark the indexes of the columns 
in X/ as J, e i? lx '<; 

3. Solve Eq. (3) to get 5; e R fexl : 

min||5,|| p s.t ||y-X,5,|| 2 < e (3) 

4. Update §(J) € R pxi wit h 5,: <5(')(j ; ) = gC- + «$,; 
where 8 (l) (h) and 5 (!_1) (J;) denote the I; th entries in 5© 
and 5 C!-U respectively; 

5. If||5ffl/2-5('-D/(i- 1) || 2 > a, update / = / + 1; go to 
Step 2. 

6. Set 8 = S®/Z. The non-zero entries in ,5 correspond 
to the columns in X to be selected. 

In Step 3, we sought to solve a Lq minimization pro- 
blem using the OMP algorithm [19]. The OMP has been 
widely used for signal recovery and approximation [18], 
[26-30]. 

It can be proven that, by using the SRVS algorithm, 
one can identify the significant variables with high prob- 
abilities. In addition, the SRVS algorithm can be shown 
convergent for any given fe and e, generating an effective 
solution for the sparse model specified by Eq. (2). In the 
following section, we discuss the sparsity control issue 
to determine the number of variables to be selected. 

Sparsity control using k 

In Step 2 of the SRVS Algorithm, we exploit Fisher- 
Yates Shuffling algorithm [31] with a window of length 
k to select X; e R nxk from X e R" x p. The length fe deter- 
mines the resolution of the SRVS algorithm. When 
k = p, the number of variables selected will be generally 
equal to the sample number n[23]. The smaller the fe, 
the more the variables selected, and those variables gen- 
erally include the variables selected with bigger fe, as 
shown in Figure 1. This multi-resolution property 
enables us to select different number of variables at dif- 
ferent significance levels. 

Further sparsity control using £ 

The parameter e given in Eq. (2) can be used for further 
sparsity control. The magnitudes of entries of 5 reflect 
the significance of the corresponding columns of X[21]. 
Thus, a threshold can be selected for <5 using cross-valida- 
tion [32]. Another way to determine a threshold is using 
the error term e (as shown in Figure 2), which reflects 
the residual of ||y — X<5||[20]. When e = 0> noises may be 
involved in the columns selected [20]. In this study, we 
set e = r||y|| 2 . From Figure 2, we show that if the first 
400 variables with amplitudes larger than 0.002 are 




Figure 1 Diagram for the sparsity control using I; in SRVS 

method p is the total number of columns/variables The results 

were generated with white noise simulation data set (size 100 x 

1e 6 ; 50 cases, 50 controls and 1e 6 variables). 
, 



selected (i.e. points (400, 0.002) on 'Regression coeffi- 
cients' curve), it corresponds to the point (400, 0.4) on 
the 'Error term coefficient' curve; it indicates that with 
these 400 variables, the error term £ = 0.4| |y| | 2 . 

Validation 

To validate the variable selected using our proposed 
SRVS algorithm, we compared our selected SNPs and 
fMRI voxels with that of previous studies. In addition, we 
used the selected SNPs and fMRI voxels to identify SCZ 
patients from healthy controls with the sparse represen- 
tation based classifier (SRC) [15] [18]. Then a leave one 
out (LOO) cross-validation approach was carried out to 
evaluate the identification accuracy. We compared the 
classification results with that of Li et al.'s method [21]. 




Index/Number of selected variables 



Figure 2 Diagram for further sparsity control using s. 

e = r llFll2 ; tne er, tri es of 8 were sorted in descending order by 
amplitude. The results were generated with white noise simulation 
data set of size 100 x 1e 6 (50 cases and 50 controls) with 
k = 0.02 x le 6 = 2e 4 

\ J 
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Furthermore, we compared the results of using variables 
selected from one type of data and that of both types of 
data. We also studied the influences of selecting different 
number of variables. 

Result 

We applied our SRVS method with the sparse model given 
by Eq. (1) to an integrative analysis of SNP and fMRI data 
sets. The results were compared with that of Li et al.'s 
method under different weighting factors. We also dis- 
cussed the sparsity control issues using k and e. 

Variable selection with different weight factors 

Sparse model given by Eq. (1) with different weight fac- 
tors were solved by our proposed SRVS method and by 
Li et al.'s method, respectively, as shown in Figure 3. It 
can be seen that at the two ends («i = 0.3 or 0.6), the 
variables were selected form one type of data. 

In each of the 16 trials given by Figure 3, we selected the 
top 200 biomarkers by our proposed SRVS method and by 
Li et al.'s method [21]. As shown in Figure 3, the weight 
factor has similar effects on the variable selection of the 
two methods. It was interesting to see that even though 
the number of SNPs was much larger than that of fMRI 
voxels (759075 vs. 153594), similar number of variables 
were selected from both data sets when weight factor «i 
for SNP data set was around 0.5 (0.46 for SRVS method 
with Lo norms, and 0.47 for Li et al.'s method). This sug- 
gests that the two data sets may contain similar informa- 
tion for the SCZ case/control study. 

Comparison with Li et al.'s method 

We selected 200 variables (SNPs and fMRI voxels) in each 
trial by our proposed SRVS method and by Li's et al.'s 



method respectively, as shown in Figure 3. However, 
further study showed that the variables selected by the two 
methods were significantly different (overlap <10%) (see 
Figure 4). Thus it was necessary to validate and compare 
those different groups of variables selected. We first com- 
pared the selected SNPs and the corresponding genes with 
the publicly reported SCZ genes for both methods. Then 
we compared the brain regions identified using those two 
methods. In addition, we compared the classification 
accuracies using the variables selected by our proposed 
SRVS method and Li et al.'s method. 

When compared with the top genes reported (see 'Top 
45 SCZ genes' in the Additional file 1). For the 16 trials 
with the top 200 variables selected in each trial, our pro- 
posed SRVS method and Li et al.'s method identified 4 dif- 
ferent reported genes, as shown in Table 1. It should be 
noted that even though both methods can identify gene 
'OPCML', they recognized the gene through different 
SNPs (SRVS is by 'rs3026883' and Li et al.'s method is by 
'rsl745939'). 

To further compare the two methods at different sparsity 
level, we studied more top variables in each of the 16 
trials. To reach this purpose, we set e = 0.3y2 and k = 0.05 
for SRVS method. For Li's method [21], the number of 
subjects selected in each run was one tenth of total num- 
ber of subjects; and we set the threshold 6 = 0.01(please 
refer to [21] for the meaning of 6). As a consequence, 500 
to 800 variables (SNPs and fMRI voxels) were selected in 
each trial. In this case, our proposed method selected 
20 reported genes. For Li et al.'s method, 14 reported 
genes were located, and 1 1 of the top 45 genes were iden- 
tified by both methods [22] . However, the genes identified 
by the two methods have <10% overlaps. For the top 
50 genes selected by the two methods, there was only one 




(a) 



(b) 



Figure 3 Variable selection with the sparse model using different methods. The Weight factor' in the plots refers to 0l\ (range of [0.3, 0.6]; 
step length = 0.02). (a) SRVS method with L 0 norms (b) Li et al.'s SLR method 
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Figure 4 Comparison of the variables (fMRI voxels/SNPs) 
selected in the 16 trails by two different methods 



gene, CSMD1, was identified by both methods. We listed 
the top 50 genes and the corresponding SNPs chosen by 
the two methods in Additional file 2. 

When comparing the fMRI voxels selected (follow the 
approach shown in Figure 3), we showed that the SRVS 
method were capable of selecting fMRI voxels that were 
clustered in specific regions, as shown in Figure 5 (a). 
Those voxels located within a same region will have high 
correlations with each other. Therefore the results indicate 
the capability of our proposed SRVS method in selecting 
significant biomarkers that are highly correlated. Further 
study showed that the brains regions selected by our pro- 
posed SRVS method were mostly reported being asso- 
ciated with SCZ [33-35], including temporal lobe, lateral 
frontal lobe, occipital lobe, and motor cortex (see Table 2). 
However, Li et al.'s method tended to select voxels that 
were scattered over different brain regions (see Figure 5 
(b)). Besides, the brain regions selected by those two meth- 
ods were largely different from each other. Thus we used 
multivariate classification approach to evaluate the effec- 
tiveness of the variables selected by two methods. 

Multivariate classification 

In this study, a LOO cross validation was carried out to 
evaluate the classification accuracy. In each run of the 



LOO validation, one sample was used for testing while the 
rest ones were used for variable selection. Results were pre- 
sented in Figure 6. We showed that our proposed SRVS 
algorithm provided significantly higher classification ratios 
(CRs) (p - value < l e ~ u ) for both the 200-selected-vari- 
able case and the 800-selected-variable case. However, 
using different number of top selected variables showed no 
significant differences for neither of the two methods 
(p-value > 0.1). 

From Figure 6 (a) we showed that the highest classi- 
fication accuracy was achieved at the weight factor 
a i = 0.5, where around equal sized SNPs and fMRI vox- 
els were selected by the SRVS method. At the two ends 
(ai = 0.3 or 0.6), the classification accuracies were rela- 
tively lower. This suggests that using biomarkers from 
both types of data may lead to better identification 
accuracy. 

Discussion 

In this study, we introduced a novel sparse representation 
based variable selection (SRVS) method, and applied it to 
an integrative analysis of SNP data and fMRI data. In the 
case of medical imaging data (e.g. fMRI data) or genomic 
data (e.g. SNP data), the number of samples tend to be 
much less than the number of variables (e.g. fMRI voxles; 
SNP loci). As a consequence, many of those variables are 
correlated and cannot be identified by traditional sparse 
signal recovery methods. The proposed SRVS method 
can identify significant variables with high probability, 
regardless of the coherence conditions required for exact 
signal recovery in compressive sensing. For example, 
significant fMRI voxels functionally correlated (within 
neighbour brain regions) were identified simultaneously 
by using our proposed SRVS algorithms (see Figure 5 (a)). 
This manifests the capability of out proposed SRVS 
method in handling big data set with small sample sizes. 

In addition, the proposed SRVS method can be gener- 
alized to integrate multiple data modalities for joint ana- 
lysis and achieve comprehensive diagnosis. As can be 
seen from Figure 6 (a), the highest classification accuracy 
was achieved using approximately equal sized variables 
from both data sets, suggesting that using biomarkers 



Table 1 The comparison with the reported first 45 SCZ genes (http://www.szgene.org/default.asp). The Index is the order of 



the specific gene in the top 45 reported genes list. 





SRVS (L 0 ) 






Li et al.'s method 




Index 


Genes 


SNPs 


Index 


Genes 


SNPs 


6 


PDE4B 


rsl 0846559 


1 


PRSS16 


rsl 3399561 


26 


NRG1 


rs 12097254 


11 


DAOA 


rs 16869700 


35 


PLXNA2 


rs481 1326 


17 


RPP21 


rsl 836942 


37 


OPCML 


rs3026883 


37 


OPCML 


rsl 745939 



The comparison of selected top SCZ genes by different methods. The Index is the order of the specific gene in the top 45 reported gene list. 
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(a) 



(b) 



Figure 5 A comparison of the fMRI voxels selected by using the two different methods (a) Voxels selected using SRVS method (b) Voxels 
selected using Li et al.'s method 



from both types of data may lead to higher diagnosis 
accuracy. 

Another advantage of the SRVS method is its multiple 
detection resolutions. By choosing different values of 
widow length one can select different number of vari- 
ables at different significance level. Furthermore, the error 
term s can be used for further sparsity control of the solu- 
tion S, selecting the most important variables. This multi- 



resolution characteristic of SRVS provides a flexible vari- 
able selection approach for big data sets. 

When compared to the previous SCZ studies, our 
method effectively identified more reported SCZ genes 
than Li et al.'s method. Furthermore, most of the brain 
regions identified using our proposed SRVS method are 
previously reported as SCZ associated brain regions. 
When using the selected variable to identify SCZ patients 



Table 2 Main brain regions of selected voxels using SRVS method 


Brain region 


Left(L)/Rigth(R) aal 


Selected voxels number 


Precuneus 


L/R 


51 


Precentral Gyrus 


L/R 


35 


Sub-Gyral 


L/R 


32 


Middle Frontal Gyrus 


L/R 


26 


Middle Temporal Gyrus 


L 


20 


Cuneus 


R 


17 


Culmen 


L/R 


16 


Paracentral Lobule 


L 


16 


Lentiform Nucleus 


L/R 


13 


Superior Temporal Gyrus 


L/R 


13 


Declive 


L/R 


13 


Cingulate Gyrus 


* 


13 


Postcentral Gyrus 


R 


9 


Medial Frontal Gyrus 


R 


7 


Superior Frontal Gyrus 


R 


7 


Anterior Cingulate 


R 


7 



The main brain regions selected using SRVS method 
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(a) 



(b) 



Figure 6 A comparison of the multivariate classification using two methods, (a) The classification accuracy of the two methods with 
different variables selected; (b) The box plots of the classification accuracy. The 'Weight factor' in the plots refers to <Xi in the range of [0.3, 0.6]; 
step length = 0.02. (a) CR by using SRVS method (b) CR by using Li et al.'s method 



from controls, our method generated significantly higher 
classification ratio than Li et al.'s method (Figure 5 (b), 
p — value < l e ~ u ). Those results demonstrated the effec- 
tiveness of our method. 

Conclusions 

Our proposed SRVS is effective in variable selection for 
complex disease as SCZ. The biomarkers selected generate 
better identification accuracy than that of Li et al.'s 
method. When combining information from fMRI data 
and SNP data for integrative analysis, higher identification 
accuracy can be achieved, demonstrating the advantage of 
the combined analysis. 

Additional material 



Additional file 1: The top 45 schizophrenia genes reported 

Additional file 2: The top 50 genes and the corresponding SNPs 
chosen by the two methods proposed SRVS method and Li et al.'s 

method. 
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